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Abstract 

This paper proposes and compares two new sampling schemes for sparse deconvolution using a Bernoulli-Gaussian 
model. To tackle such a deconvolution problem in a blind and unsupervised context, the Markov Chain Monte Carlo 
(MCMC) framework is usually adopted, and the chosen sampling scheme is most often the Gibbs sampler. However, 
such a sampling scheme fails to explore the state space efficiently. Our first alternative, the iC-tuple Gibbs sampler, 
is simply a grouped Gibbs sampler. The second one, called partially marginalized sampler, is obtained by integrating 
the Gaussian amplitudes out of the target distribution. While the mathematical validity of the first scheme is obvious 
as a particular instance of the Gibbs sampler, a more detailed analysis is provided to prove the validity of the second 
scheme. 

For both methods, optimized implementations are proposed in terms of computation and storage cost. Finally, 
simulation results validate both schemes as more efficient in terms of convergence time compared with the plain 
Gibbs sampler. Benchmark sequence simulations show that the partially marginalized sampler takes fewer iterations 
to converge than the JC-tuple Gibbs sampler. However, its computation load per iteration grows almost quadratically 
with respect to the data length, while it only grows linearly for the Tf-tuple Gibbs sampler. 
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1. Introduction 

The problem of the restoration of a sparse spike train x distorted by a linear system h and corrupted by 
noise e such that z = x * h + e, arises in many fields such as seismic exploration |1|2| and astronomy [3]. 

In this paper, we adopt a Bernoulli-Gaussian (BG) model for the spike train x, following [2] and many 
subsequent contributions such as [114] , A BG signal is an independent, identically distributed (iid) process 
defined in two stages. Firstly, the sparse nature of the spikes is governed by the Bernoulli law: 

P(q) = \ L (l-\) M - L (1) 

with the Bernoulli sequence q = [qi, . . . , ^jw]* and L = X)m=i9« 1 ' the num ber of non-zero realizations. 
Secondly, amplitudes x = [x\, . . . , xmY are assumed iid zero-mean Gaussian conditionally to q: 

x\q ~ M(0,a 2 x dmg{q)), (2) 

where diag(q) denotes a diagonal matrix whose diagonal is q. 

The MCMC approach |5|6j is a powerful numerical tool, appropriate to solve complex inference problems 
such as blind deconvolution. In the field of blind BG deconvolution, Cheng et al. pioneered the introduction 
of MCMC methods pQ. They proposed to rely on a plain Gibbs sampler, i.e., with a site-by-site updating 
scheme for the spike train, for which their algorithm constitutes a simple and canonical example. However, 
simulation results indicate that it lacks reliability: from different initial conditions, significantly different 
estimations are obtained, even after a considerable number of iterations. 

The recent contribution of [7] already identified a convergence issue linked to time-shift ambiguities, and 
proposed an efficient way to solve it. In addition, the scale ambiguities are treated by j8] by proposing a 
scale re-sampling step to accelerate the convergence rate of the Markov chain. In this study, we point out 
another source of inefficiency, unrelated to the above-mentioned ambiguities: instead of exploring the 2 
configurations of q at an acceptable speed, the Gibbs sampler tends to get stuck for many iterations around 
some particular configurations of q, often corresponding to local optimal configurations of x of the posterior 
distribution as illustrated by the example in Section l2~2l This conclusion meets Bourguignon and Carfantan's 
analysis [3]: the Markov chain equilibrates rapidly around a mode (i.e., a local optimal configuration), but 
takes a long time to move from mode to mode. 

In order to make up for this deficiency, our first proposition is to adopt a grouped Gibbs sampler [5], called 
a K-tuple Gibbs sampler, where blocks of K adjacent BG variables (qi, . . . ,qi + K-i) and their associated 
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amplitudes (xi, . . . , Xi+K-i) are jointly sampled. A comparable idea first appeared in |9j, and more recently 
in [3], under the form of a deterministic iteration aimed at producing an increase of the posterior probability, 
whereas our goal is to sample the latter. 

We then propose a second solution based on sampling the posterior distribution marginally to the ampli- 
tudes x [10]. A comparable idea is found in statistical signal segmentation [UJ where some hyper-parameters 
are partially marginalized. In fact, as Liu pointed out in [5], completely integrating out some components 
(the Gaussian amplitudes x in our case), leads to a more efficient sampling scheme called collapsed Gibbs 
sampling. However, a plain collapsed Gibbs sampling on the marginal posterior distribution involves hardly 
tractable sampling steps. In particular, it is all but simple to sample h conditional on (2, q) and marginally 
with respect to x. Our scheme solves this problem by combining a step that samples q marginally with 
respect to x and other sampling steps involving x. Such a partially marginalized sampler is fully valid from 
the mathematical point of view. In this paper, we show that it can be interpreted as a plain Gibbs sampler 
with a particular scanning order of the variables. 

Simulation tests on toy examples and on the so-called Mendel's sequence [2] confirm the efficiency of both 
methods in terms of computation time before convergence of the Markov chain. Further analysis shows the 
data length as a criterion to choose between the two proposed methods. 

This paper is organized as follows. In Section [21 after a brief formulation of the blind BG deconvolution 
problem, the Gibbs sampler of the joint posterior distribution [TJ is presented, and an example illustrates its 
inefficiency as regards the sampling of q. Sections [3] and [4] respectively introduce the generalized -FT-tuple 
Gibbs sampler and the partially marginalized sampler. In both cases, a toy example is used to evaluate 
the capability of the sampler to escape from local optimal configurations, and implementation issues are 
carefully dealt with. 

Finally, simulation results are presented in Section [5] to compare the efficiency of the sampling schemes 
according to Brooks and Gelman's convergence diagnostic [12] , and conclusions are drawn in Section [6l 

2. Problem formulation 

2.1. Statistical model 

The mathematical model of convolution reads 

p 

z n = ^2 h k x n - k + e„ (3) 

fc=0 
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for all n G {1, . . . , N}, where z ~ [z\, . . . , zn] denotes the observed vector, x = [xi, . . . , xm] is the 
unknown spike train, h = [ho, . . . , hpf the impulse response (IR) of the system (assumed finite here) and 
e = [ei, . . . ,6m] an noise vector, often assumed white stationary Gaussian. The deconvolution problem is 
said blind when h is unknown, which is the studied case here. Akin to [I], the following assumptions are 
made: 

- e ~ 7V(0, of I) is independent of x and h; 

- x is a BG process defined by Eqs.|T])-((2|) with a 2 , = 1; 

- h~M(0,a 2 I P+1 ). 

Let us remark that by setting a 2 arbitrarily to one while imposing the Gaussian prior law h removes the 
scale ambiguity inherent to the blind deconvolution problem [7]- By adopting the "zero boundary" condition 
so that M — N — P, z can be rewritten in the form of a matrix multiplication z = Hx + e, where H 
denote the N x M Toeplitz matrix of convolution. It is also useful to introduce the Toeplitz matrix X of 
size JVx (P + 1) such that Hx = X/i. According to the Monte Carlo principle, a posterior mean estimator 
of 8 = {q, x, h, A, a\ , cr£} given z can be approximated by: 

© = 7^7 E e« (4) 

k=J+l 

where the sum extends over the last I — J samples. In the MCMC framework, the samples are generated 
iteratively, so that asymptotically 8^ follows the joint posterior distribution [6]: 

P(Q | z) cx g(z - Hx: a 2 J N ) g(x; ^diag{g}) g(h; a 2 h l P+l ) P(q; A) P(a 2 h ) P(A) P(a 2 ) (5) 

where <?(•; R) denotes the centered Gaussian density of covariance R. Conjugate prior laws are adopted for 
the last three terms P(cr£),P(A) and P(cr 2 ): 

A ~ Be(l, 1), ^-1(7(1,1), 

where IG and Be respectively represent the inverse Gamma and Beta distributions. The flat shapes of the 
two distributions IG (1,1) and Be(l,l) (a uniform distribution on [0, 1]) can be interpreted as conveying no 
specific prior information of the parameters. 

2.2. Classical Gibbs Sampling 

A Gibbs sampler circumvents the difficulties of directly inferencing from the joint posterior distribution as 
in Eq. © and instead draws each parameter from its conditional law while all others are fixed. The principle 
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of Cheng et a/.'s Gibbs sampler is given in Table Q] and a pseudocode can be found in Table [2 The five 

simulation steps are iterated until convergence toward the posterior distribution in Eq. J5]) , and is finally 

built according to |(4]). Let us remark than in Table El Step 1 is performed in two substeps: for each i, qi is 

first sampled conditional on (9 \ (<?j, Xj), z), and then Xi conditional on (0 \ Xi, z). Taken jointly, these two 

substeps perform the sampling of yi = (qi, Xi) conditional on (0 \ yi, z). 
Table 1 

Cheng et al.'s Gibbs sampler. 



© Let y = (q, 


x) . For each i = 1 . . . , M, 


draw y\ 


| (fc+i) (fe) h (k) (fe) A (fc) 


© draw h (k+v > 




© draw ai k+1 ^ 


^(fe+ij^tfe+i)^ 


© draw A (fc+1) 


| q (k+i) 


© draw cr^ fe+1 ^ 


\ h (k+i) 



Table 2 

A pseudocode for Cheng et al.'s Gibbs sampler (see [T] for implementation details). 



°/ Initialization 

h <- 0; h(round(P/2)) *- 1 
q <- 0; a; <- 

Sample a 2 ~ JG(1, 1); a 2 h ~ JG(1, 1) 
repeat 

Step 1: Sample yi = (qi,Xi) 

o\ ^*tvl/{*t+al\\h\\ 2 ) 

e *— z — He 

for i = 1 to M do 

< — e + hiXi % /li is the i-th column of H 
fii <- (of/of)^, 

^ <- A(cri/cr x ) exp (^f/(2cr 2 )) 
A 4 <— + 1 - A) 

Sample ~ Bi(\i)% step la 

Sample ~ J\f(lM, c 2 ) if <7i = 1, = otherwise °/„ stepib 
end for 

% Step 2: Sample h 

S < — <T e 2 X'X + a ^ 2 eye(P + 1) Inverse of covariance matrix R 

U <— chol(S) J{ Cholesky factor of S = R _1 , i.e., R = U" 1 U" t 

h <— U\ (cr £ ~ 2 U t \X t 2; + randn(P + 1, 1)) •/. h. ~ M(m, R) with m = o-^RX'z 

'/. Step 3: Sample 

Sample ct 2 ~ lG(JV/2 + 1, ||* - Ha;|| 2 /2 + 1) 

'/. Step 4: Sample A 

Sample A - Be(l + L, 1 + M - L), with L = J2 m 1m 

'/, Step 5: Sample cr 2 

Sample a\ ~ IG(P/2 + 1, ||h|| 2 /2 + 1) 
until Convergence 
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Despite its simplicity, this scheme presents several drawbacks. In [7], it is pointed out that time-shift 
ambiguities may lead to unreliable estimates, depending on the initialization of h. To make up for this 
deficiency, an additional Metropolis-Hasting (MH) step is proposed between Steps 1 and 2 in order to allow 
moves to a time-shifted version of (y, h). Another inefficiency in the Cheng et al.'s Gibbs sampler has been 
discussed in [8] with regard to the scale factor between h and x. A scale re-sampling step is proposed 
and tested to yield faster convergence. In Table O the modified version of Step 2 incorporates the time- 
shift operation and the scale re-sampling with an efficient implementation in which /gig represents the 
Generalized Inverse Gaussian law (see [7] and [8] for more details). In what follows, the resulting global 
scheme is referred to as a hybrid sampler, since there is a Metropolis Hastings step within the Gibbs scheme 
and no more a pure Gibbs sampler. 
Table 3 

Modified version of Step 2 to incorporate the time-shift compensation according to [7] and the scale re-sampling according 
to [8] . 



'/. Time-shift compensation (Labat et al . @ ) 

Propose y' = (q',x') from y = (q,x) with probability 
1 - 2i] if y' = y, 
n{v' I V) = { V if y' = (circshif t(q, 1), circshif t(x, 1)), 

r\ if y' = (circshif t(q, — 1), circshif t(x, — 1)), 
where i] € (0, 1/2), e.g., i] = 1/4. 

S < — (7 £ 2 X^X -)- (7^ eye(-P -f- 1) % Inverse of covariance matrix R. 

S' < — <T £ 2 (X')'X' + (T^ 2 eye(-P + 1) Inverse of covariance matrix R. 

U <- chol(S) 
U' <- chol(S') 

p - a-* \\(JJ'T\{X f yzf - ^ IIU^X^H 2 + 2log|U|/|U'| 

•/. p = (m'yCR')- 1 ™' - m t R -1 m + log|R'|/|R| 

if 2log(rand) < p then"/, with probability min{l,exp(p/2)} 

y y' 

h 4- U'\ (<T7 2 (U') t \(X') t z + randn(P + 1, 1)) 
else 

h <- U\ (a-^^z + randn(P + 1, 1)) 
end if 

"/„ Scale re-sampling (Veit et al . [{8j ) 

A^(£%-P-l)/2 
a<-\x\*/o*, P^\h\y*l 

S ~ /gig (Aj 01} 0} "/• simulation by acceptation-rejection 

x <— x ■ s, h <— h/s 



Despite the improvement brought by the timeshift and scale ressampling operations in the hybrid sampler, 
we have identified another source of inefficiency, completely independent from the time-shift and/or scale 
ambiguity problem. Figure Q] is an illustration based on simulated data generated from a single spike, 
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convolved with an IR defined by 

h(i) = cos ((i - 10)£) exp (— |0.225i - 2| 15 ) , i = 0, 



,20, 



which is depicted by bullets on Figure El[b) . A local optimal configuration of x (two neighboring spikes 
instead of a single one) is chosen as the initial state in Figure QJa). The hybrid sampler generates a Markov 
chain that typically takes several thousands of iterations (depending on the chosen random generator seed) to 
visit the optimal configuration for the first time. Figure[2]further illustrates that the hybrid sampler tends to 
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(a) Initial configuration. 
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(d) 2415th iteration 
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Fig. 1. Sampled BG sequences obtained by the hybrid sampler on a simple example. From a local optimal configuration chosen 
as initial state, the Markov chain spends several hundreds of iterations before visiting the solution, i.e., a unique spike in 
position 10 (marked as a bullet). 



produce unreliable estimated values. Here, Mendel's well-known BG sequence is adopted as a test signal [2]. 
The same IR is used and the data are corrupted by Gaussian noise with of = 4 x 10~ 6 , corresponding to 
a signal-to-noise ratio (SNR) of 12.80 dB. The three estimation results in Figure [2] are obtained from the 
same simulated data z, and the same initial states, the only difference being the seed value of the random 
generator. For each Markov chain, 1000 samples are produced and the last 250 are averaged to compute the 
estimation. Substantial variations exist from one estimated result to another, especially in the number and 
the positions of spikes in Figure EJa). Actually, it can be checked that each sequence {q^} tends to remain 
constant for many iterations, instead of exploring the state space efficiently. It should be stressed that these 
results are typical, i.e., they have not been selected on purpose. 

This phenomenon is basically due to Step 1 of the sampler. More precisely, the corresponding sampled 
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50 100 150 200 250 300 5 10 15 20 

(a) estimation of x (b) estimation of h 

Fig. 2. Three different solutions of the hybrid sampler [7], obtained by changing the random generator seed only. The actual 
values are marked as bullets. 



chains hardly escape from local optima of the posterior distribution, because the consecutive samples are 
highly dependent. 
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3. Generalized Gibbs on K-tuple variables 



This inefficiency in MCMC sampling schemes has been already noticed by Bourguignon and Carfantan 
in [3]. They propose a solution involving shifts of detected spikes to adjacent positions. However, such a 
solution is a deterministic procedure aimed at increasing the posterior probability during the burn-in period 
of the Markov chain. Therefore, it does not leave the posterior distribution invariant. In contrast, our goal 
here is to propose a valid sampling step that produces spike shifts. 

Let us introduce the set = {0, . . . , K — 1}, and the notation i + cj = {i + m \ m £ uj} for all w C O. 
For any vector v, let us denote v u the subvector formed by the entries v m , m £ uj, and the subvector 
formed by the remaining entries. Similarly, for any matrix V, we introduce the submatrices V u and 
formed by columns of V, indexed by uj in the first case, and gathering the remaining columns in the second 
case. 

The principle of the grouped Gibbs sampler to K-tuple variables consists in updating vector yi+n for each 
i = 1, . . . , M — K + 1 instead of the scalar updates of Step 1 in Table [TJ In particular, let us remark that 
such a joint update strategy allows a permutation of the Bernoulli vector <jf,+n (from a configuration (0, 1) 
to (1,0) in the case of K = 2) within one iteration. 

The resulting sampling scheme is given in TableSJ The notation of t/i+Q 1 ^ ( an d that of x 1 ^^^ 2 ^) denotes 

the fact that y( k+1 '> is latter re-sampled in step 2(a). Compared to the hybrid version of the original Gibbs 

sampler of Table [1] only Step 1 has been modified. It is clear that in the case of K = 1, we are driven back 

to the hybrid sampler. 
Table 4 

Grouped Gibbs sampler updating -fC-tuple variables 

© For each i = 1 . . . ,M - K + 1, 

(a) drawfliS^lyW^j.fcW.aW.AW.z 

(b) draw^lw^glS^hW,^),, 
© (a) draw y( fc+1 ) according to Table [3] 

(b) draw^ fe + 1 )|y( fc+1 ), ( T^ ) ,^ fc) ,2 

®draw C 7e (fc+1) |a : (fe+1) ,/i (fe+1) ^ 
©draw A^ fc+1 > | 

©draw4 fc+1) |fr (fe+1) 

It should be noted that the complexity of the if-tuple sampler increases rapidly (indeed, exponentially) 
with K, while larger values of K are expected to allow a more efficient exploration of the state space. 
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The latter fact is illustrated by Table which reports the number of iterations needed to visit the exact 

configuration of q for the first time, for different values of K and of the random generator seed. For K = 1, 

the required number of iterations varies in large amounts. Typically, several thousands of iterations are 

required. In contrast, the exact configuration of q is visited after a few tens of iterations for K = 2. For 

K = 3 or K = 4, the first time visit happens immediatly in most cases. 
Table 5 

Number of iterations for -K"-tuple samplers to visit the true configuration for the first time in the example of Figure [T] 



seed value 


K = 1 


K = 2 


K = 3 


K = 4 


19 


2415 


21 


4 


1 


29 


1558 


55 


1 


1 


39 


5137 


12 


1 


1 


49 


7132 


37 


3 


2 



In practice, the question remains to know what is the best trade-off between K = 1 or larger values of 
K. In the first case, the convergence is slow in terms of iteration number, but the computing time for each 
iteration is relatively low. By increasing values of K, the computing time per iteration increases rapidly, 
while the iteration number decreases. Finding the best trade-off on a purely theoretical and general basis 
is probably hopeless. More modestly, we examine this question in Section [5] on the sole basis of Mendel's 
example of Figure El However, a prerequisite is to rely on an optimized implementation of the K-tuple 
algorithm. This is the goal of the next subsection. 

3.1. Numerical implementation 

Steps 2 to 5 being identical to those of the hybrid sampler, only the algorithmic implementation of the 
Step 1 is detailed in Table El In the K-tuple sampler, Step 1 draws (qt+n, Xi+n) according to their joint 
conditional law: 

P(qi+U,Xi+n | rest) oc P{z \ x l+n ,x_ {l+n) ,h,cr^) P{x l+n \ q l+n ,a 2 x ) P(q i+ n | A) 

f 1 1 /A \ ltqi + n 

oc exp W eK '* ~ H-t+nXt+nW 2 j g(x i+ n; o^diag(q l+ o)) f y— ) (6) 

where ex.i = z — H_( i+ Q)a;_( i+ n) , so that en,i is a function of (x_u + fy,h, z) but not of X{ + q. 

Let u> C f2 be such that qi+u gathers the nonzero entries of q%+n, so that l^Qi+n = From J6]), it is 
straightforward to deduce that Step 1(b) amounts to draw a Gaussian law defined by x^^^ = and, if 
w 7^ 0, ~ Af(m i+u; , R,:+ w ), where 
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R 



L — er. 2 H' , ,,H; 



' (J™ I 



#<"! 



h: 



(7) 
(8) 



By integration of ((6|) with respect to cci+o, it is then possible to express the marginal conditional law of 
qi+n , along which the latter vector must be sampled according to Step 1 (a) : 

Pi+U 



where 



P(Qi+u = 1, 9i+nw = 0) = = 

Pl+LJ = or-^IRi^^^expjim^R^m^l 



A 



(9) 



Compared to a direct implementation using (7|)-([9]), several sources of computational saving can be found. 
The main one exploits the fact that both H and H'H are Toeplitz matrix in the "zero boundary" case of 
finite convolution. As a consequence, neither Hi +W , nor H' +w Hi +w , nor Ri+ W depend on the position i. A 
shorter notation where i + lu is replaced by ui will thus be adopted. More importantly, it becomes possible to 
precompute and store R" 1 and |R^| for all values of w. An even more efficient scheme manipulates Cholesky 
factors as detailed in Tabled such that U^U W = S^, U w being an upper-triangular matrix. 

Table 6 

Implementation of X-tuple sampling, Step 1(a), (b) in Tabled] 



for all nonempty u> C = {0, . . . , K — 1} do 

Lu, <- #w 

S w <- <j7 2 H^H w + a x 2 eye{L u ) I = RJ 1 

U w < — chol(S w )°/„ costs O(L^) since RJ 1 is Toeplitz 

a„ <- [U^]- 1 % =|R.| 1 / 2 
end for 

for i = 1 to M - A' + 1 do 



H 



(i+n) x -(i+n) 



for all nonempty w C = {0, . . . , K — 1} do 

p i+u <- CT x L -a w exp(||c w || 2 /2)(l/A- 1)" L - 
end for 

*/. Step 1(a) 

sample Qj+n according to ([9j) 

% Step 1(b): x i+UJ ~ A/"(th w ,R 1j ) 
<- U w \(c w + randn(X w , 1)) 

end for 



Using an implementation of Table El the iteration numbers of Tabled] can be converted in computer time. 
This confirms that despite the larger time taken per iteration, the K-tuple sampler may be faster for K > 1 
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than for K = 1. However, because of the exponentially increasing cost per iteration as a function of K, the 

best trade-off cannot be reached for large values of K. Here and after, we have limited our study to values of 

K not larger than 4. Some critical results of the deconvolution problem of the Mendel's sequence in Fig. [2] 

are reported in Tab. [71 including the overall computational costs using 4 samplers in the K-tuple family and 

the corresponding number of iterations until convergence. More complete simulation results are provided in 

Section [H 
Table 7 

Convergence rate of the K-tuple (K = 1, ... ,4) Gibbs scheme for the example in Fig. [2] measured in iteration numbers and 
time in seconds. Iteration numbers are noted in each case, e.g. 4600 for the simple hybrid sampler and 900 for the 2-tuple , etc. 





K = 1 


K = 2 


K = 3 


K = 4 


Number of iterations 


4600 


900 


700 


600 


Computation time (s) 


915 


202 


285 


452 



4. Partially marginal Gibbs sampler 

Here, another sampling scheme is proposed and studied, that indirectly tackles the same inefficiency by 
marginalizing out x in the step that samples q. 

4.1. Partial marginalization of x 

In principle, one alternative to the approach in Eq. (j4]) could be to generate a Markov chain on = 
(q, h, a2, (7^, A) marginalizing x, and then build the estimates in two steps: (1) the marginal maximum 
a posteriori estimate on each binary parameter qi and the posterior mean estimate on the continuous 
parameters \ q, (2) the posterior mean estimate of x conditional on (Q,z), that is a linear estimation 
problem. Furthermore, a Gibbs sampler with target distribution P(0 | z) is likely to be more efficient than 
the hybrid sampler, particularly with respect to the sampling of q. Theoretical foundations are available 
in [13] and [U Chapter 6.7] regarding the convergence rate of the so called collapsed Gibbs sampler, with the 
marginalized equilibrium distribution P(6 | z) = J P(0 | z)dx. According to [SI Chapter 6.7], the collapsed 
Gibbs sampler produces a substantial gain in terms of convergence rate if the marginalized parameters form 
highly dependent pairs with other parameters of the Markov chain (see also [H] and references therein). 
This is exactly the case of the pair {x, q} in the spike train deconvolution problem. Park et al. [Hj also 
illustrates through three examples that such a sampler must be implemented with care to be sure that the 
desired stationary distribution is preserved. 
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However, marginalizing x from the posterior distribution by P(G | z) = J P(0 | z)dx leads to practical 
difficulties. Technically, the marginalization of x is only feasible on conditional laws of (q, A, o^). On the con- 
trary, the conditional sampling of h becomes extremely difficult if x is integrated out, since P(h \ q, a^^cr^z) 
is a multivariate, non Gaussian law with a complex structure. The same observation is made on the sampling 
of of if x is marginalized out. Instead of a plain collapsed Gibbs sampler on 9, the sampling scheme in 
Table [8] is proposed to circumvent the direct conditional sampling of h and of . Compared with the hybrid 
sampler, the main difference appears in Step 1, where x has been analytically integrated out. Subsection 14.21 
provides an efficient way to implement Step 1. 
Table 8 

The proposed partially marginal sampler. 

© (a) for each i = 1 . . . , M, 

draw q r i,2) I , «SU , ^ , , A« , z 

(b) draw x^/a) | q (k+m ) h (k) ( ^k) ) z 

© (a) draw y( fe+1 ) according to Table [3] 

(b) draw | x^ 1 ),^,^^ 

©drawa? +1) | : E(' £ + 1 ),/i( fe + 1 ),^ 
©draw A (fe+1) | q (k+1) 
© draw a^ 1 ' | 



We give here a mathematical justification that the proposed partially marginal sampler identifies with a 
Gibbs sampler of the joint posterior law (Eq. |(5]), that includes x), for a particular scanning scheme. The 
techniques are similar to those denoted as marginalization and trimming in |15| . 

Let (qi, x), (#2, x), . . . , (qM,x),h, a 2 t , A, a\ be the scanning sequence of the Gibbs sampler. For the sam- 
pling of couples (qi, x), a two-stage procedure is considered by applying the Bayes rule P(x, qi \ ®\{x, qt}, z) = 
P(q i \Q\{x,q i },z)P(x\e\x,z): 

- draw qi according to P(g, | qui-i,qi+i-.M,h, of, A, z); 

- draw x according to P(x \ q, h, of, A, z). 

The second stage is therefore repeated M times within each iteration of the sampler, while all but the 
last one are useless since the correspondingly sampled values of x do not enter any subsequent operation. 
The M — 1 corresponding sampling operations can thus be skipped and the resulting sequence of operations 
coincides exactly with the partially marginal sampler of Table [H Another mathematical proof is given in [10] 
by considering Table [8] as a collapsed Gibbs sampler with target distribution P(0 | z) and then checking the 
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invariant condition of the related Markov chain. 

In the same conditions than that of Table \E[ the partially marginal sampler takes about 20 iterations. 
Figure [3] illustrates the way it escapes from a local optimal configuration within an acceptable number of 
iterations in the example of Figure [U The true configuration is reached after 20 iterations. Moreover, it is 
observed from the configurations obtained at the 18th and 19th iterations that our scheme is able to radically 
modify x in one single step, a characteristic directly related to the marginalization of x in Step 1(a) of the 
sampler. 



0.4 
0.2 


-0.2 



0.4 
0.2 


-0.2 



10 

(a) 1st iteration 



15 



10 15 
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Fig. 3. Sampled BG sequences obtained by our partially marginal scheme in the example of Figure[T] The Markov chain escapes 
rapidly the initial configuration neighborhood. 



4.2. Marginal posterior distribution 

In this subsection, a numerical implementation of Step 1 of Table [8] is proposed, inspiring from the 
recursive method in [4] to evaluate P{qi = 0, l|0\g,; z) sequentially. While the latter is based on the storage 
and update of an L x L (L = J^i Qi) matrix, we introduce an even less burdensome strategy by handling 
the Cholesky factor matrix. The issue of the complexity per iteration is dealt here, whereas the overall 
computational costs are compared in Section [5] using simulation tests. 

First, the conditional posterior distribution of qi takes the following form, as detailed in [4]: 

P( ?i |0 \ q t - z) <x |B|-Va exp f-^B" 1 *) (j^j « exp (-/(g,)/2) 
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where 



B = Hdiag(q)H t + o*I N , 
f( qi ) = z'B-'z + log |B| + 2ft log(l/A - 1). 

After normalization, the marginal probability of ft reads: 

P(ft | 9 \ ft,*) = (1 + exp (-(/(l - qi ) - /(ft))/^)" 1 , 

which reduces to the evaluation of /(l — ft) — /(ft), using the following steps [4]: 



B 4 1 = B 1 - a e 2 B ^-Bg 1 

IBt^IBo 1 !^ 



(10) 
(11) 



(12) 
(13) 
(14) 



with B = B/ct 2 . Si = ±1 depends on whether 1 is added or removed at ft and B^ and Bo differ only at ft. 
Let us note that Bo,Bi > 0, so that it can be deduced from lfT4|) that has the same sign as Si. 

Further simplifications are also introduced in [4] by exploiting the sparse nature of B and applying the 
matrix inversion lemma. Noticing that the rank of Hdiag(<7)H* is only L, B takes the alternate form B = 
a~ 2 GG t + 1, where G = HD is full rank, and D made of the nonzero columns of diag(q). By applying the 
matrix inversion lemma, we have 



B 1 =I-a7 2 GC 1 G t 



(15) 



where C = a~ 2 G t G + I is a L x L matrix. Therefore, C _1 (L x L) can be stored and updated instead of B _1 
(N x N). In the case of adding a spike at ft, the formula for the update of C" 1 up to a matrix permutation 
operation is the following: 



6=-a e -V 1 Co 1 G t ^ 



1 + br^ b 



(16) 
(17) 



The case of removing a spike is straightforward since we have the following: 
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C7 1 + brtf b 



(18) 



Notice that in the case of spike removal, b and n are extracted from Cg 1 rather than calculated as in 
Eq. lfl6l) . These operations are invariant up to a matrix permutation in cases of adding/removing a spike at 
an arbitrary location. 

Finally, we propose to further reduce the computation and memory load, by updating and storing the 
Cholesky factor F of C _1 (F*F = C _1 for an upper-triangular matrix F). It is derived from Equations (fl3| 
and JTH that: 



= 5i 



- 2 ||M 2 -a e - 4 ||F G M 2 , 



z*Bo % = z*hi - ( 7 e - 2 (F G*z) t (F G^ i ) 

and from Eq. JIT]) and HU) that: 

/(I - Qi) - f(Qi) = log(£m) - a^rr^z'B^hi) 2 + 25. log (i - 1 

Two extra advantages of the Cholesky decomposition are: 
- since the conditional law in Step 1(b) reads 



(19) 



{x\®,z) - A/V^C^G^CT 1 ) 



(20) 



sampling x costs 0(L 2 ) using F (See Table [9]); 
- the incremental update of F also costs 0{L 2 ) using a rank-l Cholesky update method. The examples 
of adding and removing a spike at the last location is given in the following for notational simplicity. 
It is however detailed in the appendix that unlike adding a spike at at an arbitrary location, the case 
of removing a spike at an arbitrary location requires two operations of Cholesky rank-l update. 
For adding a spike at the last location, n > and equation $T7\ also reads: 



C" 1 



o l 







o l 







u 1/2 
^-!/2 



u I/ 2 
^-V2 



(21) 



In the case of removing a spike at the last location, from Eq. (fl8|) and the definition of Cholesky decompo- 
sition, we have: 
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F'F; + bTjb b 



(22) 



'0 = 



S a 



(T m 



where So is also an upper triangular matrix. It is then directly deduced that: 



(23) 



S a = b 



a*a + m 2 = r, 1 



S S = F*Fi + bnb' 



(24) 
(25) 
(26) 



Let cholupdate(A, d,' ±') be the method that updates in 0{L 2 ) the Cholesky factor of A* A ± dd t , where 

1 /2 

A is already an upper triangular matrix. Then, Fj = cholupdate(So, ,' — ). 



5. Simulation results 

5.1. Convergence diagnostic 

In order to compare empirically the convergence speed of the different samplers, we have resorted to 
Brooks and Gelman's iterated graphical method to assess convergence [32] ■ This diagnostic method is based 
upon the covariance estimation of to independent Markov chains {$jt, j = 1, . . . , to; t = 1, . . . , n} of equal 
length n. Let (respectively, $..) denote the local (respectively, global) mean of the chains. The intra-chain 
and inter-chain variances are defined as covariance matrix averages: 

^ m n 

V ' 0=1 t=l 



1 m 



that characterize the convergence behavior. Brooks and Gelman [12] proposed to evaluate the multivariate 
potential scale reduction factor (MPSRF): 

11 ' — 1 711 j • 1 

MPSRF = + A(Vr t 1 ra V inter ) 
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Table 9 

Step 1 of the partially marginal sampling scheme. The update of F is given here up to a matrix permutation. 



Initialize F and G 




°/ Step 1(a): sequential sampling of qi 




for i = 1 to M do 








n <- dj + cr e z ||Ai|| — ct £ ||FG n.j|| 




q> <— z l hi — a e ^(FG z) (FG /ij) •/. </> = z*B/i 
A/ <- log^r*) - a e 4 r- V + 25, log (1/A - 1) 




1 see Eq.lfl9ll 


sample U ~ £/([0, 1]) °/„ Uniform distribution in [0,1] 




if u > (1 + exp(— A//2J) then 




9i <- <Zi + 0; /„ <3i + dj is accepted 




if 5i — 1 then 




I — 9 — 1 Tit "n* — 1 1 i .. i i 

b < (7 e Z T { 1 F t FG t /li */. see Eq. J16J 




, r 1/9 l/9n. 

F <- cholupdate([F 0:0' 0], [br-' ;r i 1 ]/H 


-') see Eq.pTjl 


else % the case of (5^ = — 1 removing a spike, see 


appendix for detail 


L <— sum(q) 




b <- F t F(:,i),u <- 6_ l ,r- 1 <- b 4 




if £ < L then % other than the last location, 


an extra update needed 


e <- F(i,l + « : £)* 




F(l + i : L, 1 + « : L) <- cholupdate(F(l 4 


-i : L,l + i : L),e,' +') %0((L-i) 2 ) 


end if 




F(i, :) < — |],F(:,i) < — [1 % Eliminate the ith row 


and column 


F <- cholupdate(F, t 1 / 2 v,' -') •/, 0{L 2 ) 




end if 




G <- H(:,find(<?)) 




end if No change of (qi,F) otherwise 




end for 




°/ Step Kb): sampling of x 




x ^0 




L *— sum(<7) 




x(f ind(q)) <- F* (cr^FG^ + randn(L, 1)) •/. see 


Eq. {20j 



where A(-) returns the largest eigenvalue of the covariance matrix. Convergence is diagnosed when MPSRF is 
close to one (e.g., MPSRF < 1.2 as proposed in [12]). In order to get a graphical evolution of the convergence, 
each chain is divided into batches of b samples, and the MPSRF is calculated upon the second halves of the 
Markov chains {&jt}, t = 1, . . . , kb of increasing lengths kb, while the first halves are regarded as a burn-in 
period. From an empirical basis, it is also advised to select b rts n/20 in [12] . 



5.2. Simulation tests 



A test scenario is designed here to compare the generalized K-tuple Gibbs sampler (including the classical 
hybrid sampler in the case K = 1) of Section [3] and the partially marginal sampler of Section HJ in terms 
of robustness with respect to different random initial conditions. Time-shift and scale ambiguities are taken 
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into account in all the methods, as described in Section [231 Let us reexamine the example of the benchmark 
sequence (Mendel's sequence, for which M = 300 and A = 0.1.) in Figure [H To concentrate on the conver- 
gence quality of the Bernoulli sequence, we evaluate the MPSRF evolutions of {q^ k '} for ten independent 
Markov chains, i.e., m = 10. Figured] is compares the convergence rate of -ftT-tuple samplers for different 
values of K: while MPSRF falls under 1.2 after 515 seconds of simulation for the classical hybrid sampler 
(K = 1), this threshold is reached after about 285 seconds for the 3-tuple sampler and in less than 202 
seconds for the 2-tuple sampler. It is interesting to observe that both the 3 and 4-tuple sampler have a more 
stable convergence behavior in the convergence zone while the hybrid sampler presents the most oscillating 
MPSRF values. These results confirm the discussion in Section [31 augmenting K improves the convergence 
quality in terms of iteration numbers while in the meantime increases the computational load per iteration. 
The overall performance relies on the trade-off between the two criteria. Thus, the best compromise in the 
iC-tuple sampler series is reached for K = 2 in the given example. 




100 200 300 500 1000 

time in seconds (in log-scale) 

Fig. 4. Evolution of MPSRF for the K -tuple sampler family (K = 1,2,3) in the case of Mendel's sequence (SNR = 12.80 dB). 

Next we compare the convergence diagnostic results on the 2-tuple sampler and the partially marginal 
sampler. Each MPSRF in Figure [5] is evaluated for every b = 100 iterations. Altogether 300 iterations 
(116 seconds of simulation) are required for the partially marginalized sampler to reach convergence, not to 
mention the overall lower level of MPSRF in the convergence zone compared with the 2-tuple sampler. Thus 
the partially marginalized sampler takes fewer iterations and less time to converge than the best selected 
if -tuple sampler in Figure [H We also point out that the asymptotic complexity per iteration of the partially 
marginal sampler is lower than that of the 2-tuple sampler in the given simulation example, though the 
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contrary is observed for the first 100 samples (corresponding to the heating period) in which the number of 
detected spikes L is relatively important. 




□ partially marginal sampler 
+ K-tuple (K = 2) sampler 



100 



200 



300 



400 



500 



time in seconds 



Fig. 5. Evolution of MPSRF for the partially marginal sampler and the 2-tuple sampler in the case of Mendel's sequence. 
MPSRF has been computed every 100 samples for both methods. 

To further illustrate its robustness, deconvolution results obtained by the partially marginal sampler are 
shown in Figure [6] under identical initial conditions as in Figure [2l Only one of the results is reported, as 
they are undistinguishable from each other (perfectly consistent estimations). And this is true for all the 
samplers once their MPSRF values drop below the 1.2 threshold. 
0.2 r 

f m 

0.15 




50 100 150 200 250 300 

estimation of x 



5 10 15 

estimation of h 



Fig. 6. Estimation result using the proposed sampler, under the same conditions as in Figure [2] The 10 independent Markov 
chains yield undistinguishable estimates. 

In the given example of the Mendel's sequence (for which M = 300), the partially marginal sampler 
achieves a better compromise between the two criteria [5] than the most performant sampler in the gen- 
eralized if-tuple family: the time required to reach convergence is reduced by half. However, the partially 
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marginal sampler do not necessarily outperform the if -tuple sampler in dealing with longer datasets. Notice 
that the computational complexity for a Cholesky update operation in the latter case is 0(L 2 ) (L = J2li ~ 
AM, the number of detected spikes), thus proportional to the data length of the BG sequence, whereas the 
complexity per iteration of the if -tuple sampler is non-increasing with respect to the number of detected 
spikes L. To illustrate the disadvantage of the partially marginal sampler, it is necessary to test on simula- 
tion examples of longer observation data z while fixing the Bernoulli parameter A. Figure [7] compares the 
complexity per iteration as a function of BG sequence length M for the two algorithms. The 2-tuple sampler 
potentially outperforms the partially marginal sampler due to its linear growth of complexity per iteration 
with respect to M. Figure [8] resumes the convergence time for all the 5 samplers (K = 1, ... ,4 in solide 
lines and the partially marginalised sampler in dashed line) on simulation data whose lengths vary from 100 
to 1600 and A = 10%. While the partially marginal sampler outperforms all the rest algorithms in areas 
of shorter data length (M < 800), its quadratic computation cost per iteration as shown in Fig. [7] yields 
the contrary in applications with longer data lengths. We also note that for the given bernoulli parameter 
A = 10%, the hybrid sampler is never a good choice inside the if-tuple family regardless of the data length. 




500 1000 1500 2000 2500 3000 

spike train length M 



Fig. 7. Computation cost per iteration comparison between the 2-tuple Gibbs sampler and the partially marginal sampler. The 
two methods are tested on the same simulated observation data z, for which the length of the spike train ranges from 100 to 
3000 and the Bernoulli parameter A is fixed to 10%. Polynomial interpolations (of degree 1 and 2 respectively) are traced to 
show the quasi-linear and -quadratic evolution of complexity per iteration with respect to the spike train length M . 
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100 200 400 800 1600 

spike train length M (in log-scale) 

Fig. 8. Convergence time comparison for the 5 samplers applied on spike trains of increasing length. 
6. Conclusion 

This paper proposes two distinct methods in dealing with the identified inefficiency concerning the 
Bernoulli label q sampling in the BG deconvolution problem. Detailed algorithms are given for both methods 
as well as their performance on simulation tests. It is shown that both methods are mathematically valid 
MCMC sampling schemes and both achieve better convergence properties in comparison to the hybrid sam- 
pler, the latter is based on Cheng et al.'s Gibbs scheme with time-shift compensations and scale re-sampling. 
Both proposed methods demonstrate better trade-off for the convergence rate by reducing drastically the 
iteration steps needed to converge. On the simulation test of Mendel's sequence, the partially marginal sam- 
pler is shown to out-perform the whole class of K-tuple samplers, among which the optimized parameter is 
found at K = 2. This is however due to the relatively small size of the simulation problem and the contrary 
is observed by simply augmenting the simulated data length. 

We conclude with some extensions of the proposed techniques. By combining the idea of a K-tuple joint 
sampling and that of partial marginalization of x, i.e., updating tfo+n | Q-(i+Q),h, a e , A in Step 1 of Tabled 
one might further reduce the number of iterations necessary to escape from local optima at the cost of higher 
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computation load per iteration. Such a solution should be adopted when the deconvolution problem presents 
more difficulties, characterized by either low SNR, band-limited IR and/or less sparse spike trains. 



APPENDIX To update the Cholesky factor R on removing a spike at the location i such that K'K = 
R*R — mi', we specify : 





A b C 




G H 


R = 


S e* 


,K = 







OOF 




J 



v =V?R t -R(:,i) 

(A-2) 

t- 1 = b l b + S 2 

where {A,F, G,J} are upper-triangular matrices and S = Hi,i the ith diagonal value of R. (K-^-i) is 
therefor the updated Cholesky factor. 

A most evident solution consists an operation K = Cholupdate(R, v' — ') followed by an extraction to 
get the updated Cholesky factor (of lowered dimension) : 

G H 

J 

which however systematically fails due to the fact that K*K is no longer definite positive. In fact, one 
necessary condition [16] for a successful rank-1 downdate operation (update by subtracting t)t)') is that 
p 2 = 1 — p'p > for which p is defined by 

R'p = v. (A-3) 

Without inverting R', it can be deduced from (j A-2[) that p = y / rR(:, j) and thus p 2 = 0. The algorithm of 
Cholupdate by a series of Givens rotation should break down according to [16] . 

The problem of non-definite positiveness is avoided by a strategy that first lower the dimension and then 
perform the Cholupdate algorithm. From l|A-l[) . we obtain: 
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A* A 


A' 6 


A*C 




G*G 





G*H 




6'A 


r" 1 


b t C + 5e l 















C'A 


C'b + Je 


C'C + ee^F'F 




H'G 





H l H + J' J 



vv 



The submatrices on both sides by eliminating the ith row and column yields 



A* A 


A'C 




G*G 


G'H 


C*A 


C*C 


f ee t + F*F 




H*G 


H*H + J' J 



V-iV_, 



for which V-i denotes v without the ith component. The following relation is key to the algorithm that we 
propose : 



A C 


t 


A C 


+ 










t 


G H 


t 


G H 


F 




F 




e 




e 




J 




J 



The two steps that leads to a spike removing update are : 

(i) an update on F (of reduced dimension): 

M = Cholupdate(F,e,'+'); 

(ii) a downdate that keeps the positiveness of the product matrix and will not fail: 



(A-4) 



G H 

J 



Cholupdate( 



A C 
M 



it can be proven that p 2 = 1 — p t p is strictly positive in this case. 
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